## Version upload to GitHub
The Iowa Department of Commerce requires that every store that sells alcohol in bottled form for off-the-premises consumption must hold a class "E" liquor license (an arrangement typical of most of the state alcohol regulatory bodies). All alcoholic sales made by stores registered thusly with the Iowa Department of Commerce are logged in the Commerce department system, which is in turn published as open data by the State of Iowa. Data Source : https://console.cloud.google.com/marketplace/product/iowa-department-of-commerce/iowa-liquor-sales?project=fiery-province-334423
https://data.iowa.gov/Sales-Distribution/Iowa-Liquor-Sales/m3tr-qhgy
The Dataset is about 6GB on the Google BigQuery. The is total 22,716,731 rows.
On the basis of Sales data, I am going to find the answer to the following question
| Column Name | Description | Data Type |
|---|---|---|
| Invoice/Item Number | Concatenated invoice and line number associated with the liquor order. | Plain Text |
| Date | Date of Order | Date/Time |
| Store Number | Unique number assigned to the store who ordered the liquor. | Plain Text |
| Store Name | Name of store who ordered the liquor. | Plain Text |
| City | City where the store who ordered the liquor is located | Plain Text |
| Zip Code | Zip Code where the store who ordered the liquor is located | Plain Text |
| Store Location | Location of store who ordered the liquor. | Plain Text |
| County Number | Iowa county number for the county where store who ordered the liquor is located | Plain Text |
| County | County where the store who ordered the liquor is located | Plain Text |
| Category | Category code associated with the liquor ordered | Plain Text |
| Category Name | Category of the liquor ordered | Plain Text |
| Vendor Number | The vendor number of the company for the brand of liquor ordered | Plain Text |
| Vendor Name | The vendor name of the company for the brand of liquor ordered | Plain Text |
| Item Number | Item number for the individual liquor product ordered. | Plain Text |
| Item Description | Description of the individual liquor product ordered. | Plain Text |
| Pack | The number of bottles in a case for the liquor ordered | Number |
| Bottle Volume (ml) | Volume of each liquor bottle ordered in milliliters. | Number |
| State Bottle Cost | The amount that Alcoholic Beverages Division paid for each bottle of liquor ordered | Number |
| State Bottle Retail | The amount the store paid for each bottle of liquor ordered | Number |
| Bottles Sold | The amount the store paid for each bottle of liquor ordered | Number |
| Sale (Dollars) | Total cost of liquor order (number of bottles multiplied by the state bottle retail) | Number |
| Volume Sold (Liters) | Total volume of liquor ordered in liters. (i.e. (Bottle Volume (ml) x Bottles Sold)/1,000) | Number |
| Volume Sold (Gallons | Total volume of liquor ordered in gallons. (i.e. (Bottle Volume (ml) x Bottles Sold)/3785.411784) | Number |
*Store Location:
The Address, City, State and Zip Code are geocoded to provide geographic coordinates. Accuracy of geocoding is dependent on how well the address is interpreted and the completeness of the reference data used.
#install
# pip install plotly==5.4.0
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import plotly.express as px
import plotly as py
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.arima_model import ARIMA
#pip install progressbar
from progressbar import *
from sklearn.preprocessing import LabelEncoder
from math import sqrt
import plotly.graph_objects as go
import plotly.express as px
import plotly as py
pd.set_option('display.max_rows',50)
pd.set_option('display.max_columns',150)
import matplotlib.pyplot as plt
#pip install datetime
from datetime import datetime as dt
#graph design
#pip install dash
import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
import plotly.graph_objects as go
C:\Users\MELODY~1\AppData\Local\Temp/ipykernel_16572/2807057372.py:40: UserWarning: The dash_core_components package is deprecated. Please replace `import dash_core_components as dcc` with `from dash import dcc` C:\Users\MELODY~1\AppData\Local\Temp/ipykernel_16572/2807057372.py:41: UserWarning: The dash_html_components package is deprecated. Please replace `import dash_html_components as html` with `from dash import html`
#For export the plotly graph to chart_studio
import chart_studio
username = 'melodyyip' # your username
api_key = 'IcyruGDTrXTRHU98hvix' # your api key - go to profile > settings > regenerate key
chart_studio.tools.set_credentials_file(username=username, api_key=api_key)
After looking at the data, I would study the general picture of the total_sale. The only columns we are interested in are; 'year', 'month', 'date' and 'sale_dollars'. So we use read_gbq() to extract a summarized data
from pandas.io import gbq
df_totalsale = gbq.read_gbq('SELECT date,extract(year from date) as year,ROUND(SUM(sale_dollars),2) as sale_dollars FROM bigquery-public-data.iowa_liquor_sales.sales where extract(year from date)>2016 group by 1,2', project_id='iowa-liquor-retail-sale', reauth='true')
df_totalsale.head(10)
Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=725825577420-unm2gnkiprugilg743tkbig250f4sfsj.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fbigquery&state=ohD2ybYr7cTWZ8opitWfDnrwaJR7hL&prompt=consent&access_type=offline Enter the authorization code: 4/1AX4XfWgIajBT5QN5dzjO3MQg8UjwBRleT06fXHjMcKOB_Bkd8sDeKTqt0Pw
| date | year | sale_dollars | |
|---|---|---|---|
| 0 | 2021-01-11 | 2021 | 1439213.48 |
| 1 | 2022-01-31 | 2022 | 1071259.33 |
| 2 | 2021-08-04 | 2021 | 1782963.46 |
| 3 | 2019-03-12 | 2019 | 1572739.14 |
| 4 | 2018-10-02 | 2018 | 2071385.26 |
| 5 | 2018-09-27 | 2018 | 1296045.38 |
| 6 | 2021-02-01 | 2021 | 1505470.88 |
| 7 | 2019-11-26 | 2019 | 1671031.65 |
| 8 | 2021-05-04 | 2021 | 1882145.38 |
| 9 | 2017-09-05 | 2017 | 1357584.11 |
import calendar
df_totalsale['year'] = df_totalsale['date'].apply(lambda x: x.strftime('%Y')).astype(int)
df_totalsale['year_month'] = df_totalsale['date'].apply(lambda x: x.strftime('%Y-%m'))
df_totalsale['month'] = df_totalsale['date'].apply(lambda x: x.strftime('%m')).astype(int)
df_totalsale['month_'] = df_totalsale['month'].apply(lambda x: calendar.month_name[x])
import plotly.express as px
dfg = df_totalsale[['date','sale_dollars']]
dfg = dfg.groupby(pd.Grouper(key='date',freq='M')).sum().reset_index()
dfg.rename(columns = {"sale_dollars": "sum"}, inplace=True)
fig = px.line(dfg, x="date", y="sum"
,hover_data={"date": "|%B %d, %Y"}
,markers=True
,color_discrete_sequence=px.colors.diverging.PRGn
,template = "plotly_white"
)
fig.update_layout(
title="Total Sale Over Time(2016-2020)"
,xaxis_title="date"
,yaxis_title="sale_dollars"
)
fig.update_xaxes(
dtick="M1",
tickformat="%b\n%Y",
ticklabelmode="period")
fig.show("")
fig.write_html("IowaWaterfall-fig1")
#Export plotly to chart_studio
import chart_studio.plotly as py
py.plot(fig, filename = 'IowaWaterfall-fig1', auto_open=True)
'https://plotly.com/~melodyyip/31/'
The line chart shows changes in the iowa liquor sale since Jan 2012 to Nov 2021. It shows a slow and steady increase over the years. There is also an obvious seasonal trend. Therefore, I plot the multiple line chart below for further investigation.
#import seaborn as sns
import calendar
# given some dataframe, perform groupby and reset the index
dfg = df_totalsale.groupby(['year', 'month']).agg({'sale_dollars': sum}).reset_index()
dfg['month'] = dfg['month'].apply(lambda x: calendar.month_name[x])
import plotly.express as px
dfg = df_totalsale[['month_','year','sale_dollars']]
dfg = dfg.groupby(['month_','year']).sum().reset_index()
dfg.rename(columns = {"sale_dollars": "sum"}, inplace=True)
fig = px.line(dfg, x="month_", y="sum"
# ,hover_data={"date": "|%B %d, %Y"}
,color='year'
,markers=True
,color_discrete_sequence=px.colors.diverging.PRGn
,template = "plotly_white"
)
fig.update_layout(
title="Total Sale Over Time(2016-2020)"
,xaxis_title="year_month"
,yaxis_title="sale_dollars"
)
fig.update_xaxes(
dtick="M1",
# tickformat="%b\n%Y",
ticklabelmode="period")
fig.show()
fig.write_html("IowaWaterfall-fig2")
#Export plotly to chart_studio
import chart_studio.plotly as py
py.plot(fig, filename = 'IowaWaterfall-fig2', auto_open=True)
'https://plotly.com/~melodyyip/37/'
From the graph, it can be observed that there is a cave-shaped curve every Oct to Dec. There is also significatly large sales in Oct and Dec for 2012 to 2018. However, for the recent year, it seems the customer would like to purchase liquors on March, June and December
#manipulation data for waterfall chart
from pandas.io import gbq
MoM_Data = df_totalsale.loc[df_totalsale['year'] > 2020]
MoM_Data = MoM_Data[['year_month','sale_dollars']]
MoM_Data = MoM_Data.groupby(['year_month']).sum().reset_index()
MoM_Data.rename(columns = {"sale_dollars": "sum"}, inplace=True)
MoM_Data = MoM_Data.sort_values(by=['year_month'], ascending=True)
MoM_Data.head()
| year_month | sum | |
|---|---|---|
| 0 | 2021-01 | 28665659.86 |
| 1 | 2021-02 | 29617805.95 |
| 2 | 2021-03 | 36654828.86 |
| 3 | 2021-04 | 34734921.83 |
| 4 | 2021-05 | 33659554.94 |
MoM_Data['Last_Month'] = np.roll(MoM_Data['sum'],1)
# MoM_Data['MoMGrowth'] = (MoM_Data['sum']/MoM_Data['Last_Month'])-1
MoM_Data['MoMGrowth_amount'] = np.select(
[
MoM_Data['year_month'] == '2021-01'
,MoM_Data['year_month'] == '2022-01'
],
[
MoM_Data['sum']
,0
# ,MoM_Data['sum']
],
default= MoM_Data['sum'] - MoM_Data['Last_Month']
).round(-4)
MoM_Data['MoMGrowth_text'] = np.select(
[
MoM_Data['year_month'] == '2021-01'
,MoM_Data['year_month'] == '2022-01'
],
[
MoM_Data['sum']
,MoM_Data['sum']
],
default= MoM_Data['sum'] - MoM_Data['Last_Month']
).round(-4)
# !pip install waterfallcharts
# import waterfall_chart
# import matplotlib.pyplot as plt
# # plt.rcParams["figure.figsize"] = (10,6)
# y = list(MoM_Data['year_month'].values.flatten())
# x = list(MoM_Data['MoMGrowth_amount'].values.flatten())
# x = [element /1000000 for element in x]
# waterfall_chart.plot(y, x,sorted_value=False,net_label='2022-01',rotation_value=90)
# # plt.xticks(fontsize=10)
# # plt.yticks(fontsize=10)
# # plt.ylabel('Sale', fontsize=10)
# # sns.despine()
# # Add Title
# plt.title("YoY Sales Change(2021-2022) - using matplotlib")
# # Add Axes Labels
# # plt.xlabel("X-axis")
# plt.ylabel("Sale")
# plt.ylim(bottom=0)
# plt.show()
import plotly.graph_objects as go
text = list(MoM_Data['MoMGrowth_text'].values.flatten())
text = [element /1000000 for element in text]
fig = go.Figure(go.Waterfall(
name = "20", orientation = "v",
measure = ["relative", "relative", "relative","relative","relative", "relative", "relative", "relative", "relative","relative", "relative", "relative", "total"],
x = list(MoM_Data['year_month'].values.flatten()),
textposition = "outside",
text = text,
y = list(MoM_Data['MoMGrowth_amount'].values.flatten()),
connector = {"line":{"color":"rgb(63, 63, 63)"}},
))
fig.update_layout(
title = "YoY Sales Change(2021-2022)",
showlegend = False,
# uniformtext_minsize=9, uniformtext_mode='hide'
# width=850, height=600,
)
fig.show()
fig.write_html("IowaWaterfall-fig3")
#Export plotly to chart_studio
import chart_studio.plotly as py
py.plot(fig, filename = 'IowaWaterfall-fig3', auto_open=True)
'https://plotly.com/~melodyyip/41/'
#YoY Sale Change by category
from pandas.io import gbq
df = gbq.read_gbq('SELECT date,extract(year from date) as year,category,ROUND(SUM(sale_dollars),2) as sale_dollars FROM bigquery-public-data.iowa_liquor_sales.sales where extract(year from date) > 2017 and extract(year from date) < 2020 group by 1,2,3', project_id='iowa-liquor-retail-sale', reauth='true')
df.head(10)
Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=725825577420-unm2gnkiprugilg743tkbig250f4sfsj.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fbigquery&state=adh9RAVEHYcYyjG2Fi0V4vDStMddr5&prompt=consent&access_type=offline Enter the authorization code: 4/1AX4XfWiNo4OU1LFAMKrZMu9jT2j3XzoD2XcgSOT4KlJ_5AvwP-SlXUpy-w4
| date | year | category | sale_dollars | |
|---|---|---|---|---|
| 0 | 2018-08-09 | 2018 | 1011600.0 | 9668.30 |
| 1 | 2018-08-21 | 2018 | 1081100.0 | 5479.76 |
| 2 | 2018-10-25 | 2018 | 1092100.0 | 11758.20 |
| 3 | 2018-08-27 | 2018 | 1081100.0 | 8659.70 |
| 4 | 2018-10-09 | 2018 | 1012300.0 | 26186.93 |
| 5 | 2018-10-09 | 2018 | 1081500.0 | 2683.67 |
| 6 | 2018-08-20 | 2018 | 1081500.0 | 4107.87 |
| 7 | 2018-08-13 | 2018 | 1081100.0 | 4603.81 |
| 8 | 2018-10-16 | 2018 | 1081500.0 | 2755.96 |
| 9 | 2018-08-29 | 2018 | 1901200.0 | 10308.76 |
df['year'] = df['date'].apply(lambda x: x.strftime('%Y')).astype(int)
# df['year_month'] = df['date'].apply(lambda x: x.strftime('%Y-%m'))
#Group the Liquor category
df['category_group'] = df['category'].str[:3]
category_data = [['101', 'whiskies'],
['102', 'mezcal/tequila'],
['103', 'vodka'],
['104', 'gins'],
['105', 'brandies'],
['106', 'rum'],
['107', 'cocktails/rtd'],
['108', 'liqueur/schnapps'],
['109', 'spirits'],
['110', 'americanAlcohol'],
['150', 'americanHighProofBeer'],
['170', 'temporary&specialtyPackages'],
['190', 'specialOrderItems']]
category_group_table = pd.DataFrame(category_data, columns = ['category_group', 'category_group_name'])
#Left join the table 'df' with lookup table 'category_group_table'
left_join = pd.merge(df,
category_group_table,
on ='category_group',
how ='left')
#Aggregate the sale of liquor category group
Cat_Data = df[['year','category_group','sale_dollars']]
Cat_Data = Cat_Data.groupby(['year','category_group']).sum().reset_index()
Cat_Data.rename(columns = {"sale_dollars": "sum"}, inplace=True)
Cat_Data = Cat_Data.sort_values(by=['category_group','year'], ascending=True)
Cat_Data.head()
| year | category_group | sum | |
|---|---|---|---|
| 0 | 2018 | 101 | 1.061550e+08 |
| 11 | 2019 | 101 | 1.119310e+08 |
| 1 | 2018 | 102 | 2.004997e+07 |
| 12 | 2019 | 102 | 2.195227e+07 |
| 2 | 2018 | 103 | 8.278324e+07 |
Cat_DataPivot = Cat_Data.pivot(*Cat_Data).reset_index()
Cat_DataPivot
| category_group | year | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 170 | 190 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2018 | 1.061550e+08 | 20049974.43 | 82783239.89 | 8051070.18 | 16014965.14 | 37634131.88 | 5355731.28 | 48456862.63 | 2633861.34 | 4863967.07 | 1737404.10 |
| 1 | 2019 | 1.119310e+08 | 21952271.76 | 83199851.92 | 8305399.30 | 17160863.48 | 36973850.47 | 5507831.27 | 50972226.76 | 2820265.56 | 8144156.41 | 1912236.03 |
Cat_Data['last_sum'] = np.roll(Cat_Data['sum'],1)
Cat_Data['last_category'] = np.roll(Cat_Data['category_group'],1)
Cat_Data.head()
| year | category_group | sum | last_sum | last_category | |
|---|---|---|---|---|---|
| 0 | 2018 | 101 | 1.061550e+08 | 1.912236e+06 | 190 |
| 11 | 2019 | 101 | 1.119310e+08 | 1.061550e+08 | 101 |
| 1 | 2018 | 102 | 2.004997e+07 | 1.119310e+08 | 101 |
| 12 | 2019 | 102 | 2.195227e+07 | 2.004997e+07 | 102 |
| 2 | 2018 | 103 | 8.278324e+07 | 2.195227e+07 | 102 |
#calculate the grow by category
Cat_Data.loc[Cat_Data['category_group'] == Cat_Data['last_category'], 'text%'] = (((Cat_Data['sum']/Cat_Data['last_sum'])-1)*100)
Cat_Data.loc[Cat_Data['category_group'] == Cat_Data['last_category'], 'grow'] = Cat_Data['sum'] - Cat_Data['last_sum']
Cat_Data = Cat_Data.loc[Cat_Data['year'] == 2019]
Cat_Data
| year | category_group | sum | last_sum | last_category | text% | grow | |
|---|---|---|---|---|---|---|---|
| 11 | 2019 | 101 | 1.119310e+08 | 1.061550e+08 | 101 | 5.441125 | 5776024.49 |
| 12 | 2019 | 102 | 2.195227e+07 | 2.004997e+07 | 102 | 9.487779 | 1902297.33 |
| 13 | 2019 | 103 | 8.319985e+07 | 8.278324e+07 | 103 | 0.503256 | 416612.03 |
| 14 | 2019 | 104 | 8.305399e+06 | 8.051070e+06 | 104 | 3.158948 | 254329.12 |
| 15 | 2019 | 105 | 1.716086e+07 | 1.601497e+07 | 105 | 7.155172 | 1145898.34 |
| 16 | 2019 | 106 | 3.697385e+07 | 3.763413e+07 | 106 | -1.754475 | -660281.41 |
| 17 | 2019 | 107 | 5.507831e+06 | 5.355731e+06 | 107 | 2.839948 | 152099.99 |
| 18 | 2019 | 108 | 5.097223e+07 | 4.845686e+07 | 108 | 5.190935 | 2515364.13 |
| 19 | 2019 | 109 | 2.820266e+06 | 2.633861e+06 | 109 | 7.077222 | 186404.22 |
| 20 | 2019 | 170 | 8.144156e+06 | 4.863967e+06 | 170 | 67.438560 | 3280189.34 |
| 21 | 2019 | 190 | 1.912236e+06 | 1.737404e+06 | 190 | 10.062825 | 174831.93 |
dff = df.groupby(['year'])['sale_dollars'].sum().reset_index()
dff
| year | sale_dollars | |
|---|---|---|
| 0 | 2018 | 3.342168e+08 |
| 1 | 2019 | 3.492203e+08 |
dff = df.groupby(['year'])['sale_dollars'].sum().reset_index()
ly = dff.iat[0,1]
cy = dff.iat[-1,1]
y = list(Cat_Data['grow'].values.flatten())
y.insert(0,ly)
y.insert(len(y),cy)
import plotly.graph_objects as go
text = list(Cat_Data['text%'].values.flatten())
text = [round(element,2) for element in text]
text = [str(int)+'%' for int in text]
text.insert(0, "")
text.insert(len(text),"" )
text
fig = go.Figure(go.Waterfall(
name = "20", orientation = "v",
measure = ["relative"
,"relative"
,"relative"
,"relative"
,"relative"
,"relative"
,"relative"
,"relative"
,"relative"
,"relative"
,"relative"
,"relative"
,"total"],
x = [['Total'
,'Distilled Drinks'
,'Distilled Drinks'
,'Distilled Drinks'
,'Distilled Drinks'
,'Distilled Drinks'
,'Distilled Drinks'
,'Distilled Drinks'
,'Distilled Drinks'
,'Distilled Drinks'
,'Undistilled Drinks'
,'Undistilled Drinks'
, 'Total '],
['2021'
,'whiskies'
,'mezcal/tequila'
,'vodka'
,'gins'
,'brandies'
,'rum'
,'cocktails/rtd'
,'liqueur/schnapps'
,'spirits'
,'temporary'
,'specialOrderItems'
,'2022']],
textposition = "outside",
text = text,
y = y,
connector = {"line":{"color":"rgb(63, 63, 63)"}},
))
fig.update_layout(
title = "Liquor Sales Growth from 2021 to 2022",
showlegend = False,
# uniformtext_minsize=9, uniformtext_mode='hide'
width=950, height=700
,yaxis_range=[ly-10000000,cy+4000000]
)
fig.show()
fig.write_html("IowaWaterfall-fig4")
#Export plotly to chart_studio
import chart_studio.plotly as py
py.plot(fig, filename = 'IowaWaterfall-fig4', auto_open=True)
'https://plotly.com/~melodyyip/43/'
!jupyter nbconvert --to html Iowa_financial_analysis.ipynb
[NbConvertApp] Converting notebook Iowa_financial_analysis.ipynb to html [NbConvertApp] Writing 4350748 bytes to Iowa_financial_analysis.html